Substructure Discovery Using Minimum Description Length Principle and Background Knowledge
نویسنده
چکیده
Discovering conceptually interesting and repetitive substructures in a structural data improves the ability to interpret and compress the data. The substructures are evaluated by their ability to describe and compress the original data set using the domain’s background knowledge and the minimum description length (MDL) of the data. Once discovered, the substructure concept is used to simplify the data by replacing instances of the substructure with a pointer to the newly discovered concept. The discovered substructure concepts allow abstraction over detailed structure in the original data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the goals of the data analysis. The structural data is represented as a labeled graph. A substructure is a connected subgraph within the graphical representation. An instance of a substructure in an input graph is a set of vertices and edges from the input graph that match, graph theoretically, to the graphical representation of the substructure. The substructures are evaluated by their ability to describe and compress the original data set using the domain’s background knowledge and the minimum description length (MDL) of the data. Once interesting substructures are discovered, they can be replaced by a single representative node in the original graph, and can be used as part of another substructure definition in a hierarchy of discovered structures. The minimum description length principle states that the best theory to describe a set of data is the theory which minimizes the description length of the entire data set. The minimum description length of a graph is defined to be the number of bits necessary to completely describe the graph. The theory that best accounts for a collection of data is the one that minimizes I(S) + I(GIS), where S is the discovered substructure, G is the input graph, I(S) is the number of bits required to encode the discovered substructure, and I(GIS) is the number of bits required to encode the input graph G with respect to S. Although the principle of minimum description length is useful for discovering substructures that maximize com1442 Student Abstracts pression of the data, scientists often employ knowledge or assumptions of a specific domain to the discovery process. To make the discovery process more powerful across a wide variety of domains, the background knowledge have been added to guide the discovery process. This background knowledge is entered in the form of rules for evaluating substructures. Because only the most-favored substructures are kept and expanded, these rules control the discovery process of the system. For example, in the CAD circuit domain, circuit components can be classified according to their passivity. A component which is not passive is said to be active. The active components are the main driving components. Identifying the active components is the first step in understanding the main function of the circuit. The component rule assigns relatively higher values to the active components, and assigns lower values to the passive components. Once the active components are selected, attention can be focused on the passive components. Similarly, the loop analysis rule favors subcircuits containing loops. Since the components in the closed path are generally a part of the subcircuit or the subcircuit itself. Furthermore, the component complexity rule prefers minimum number of distinct component in the substructure. The approch has also been applied to the domains of chemical compound analysis, scene analysis, CAD circuit analysis, and analysis of artificially-generated graphs. The results demonstrate the applicability and significance of the approcah in the above domains.
منابع مشابه
Substructure Discovery Using Minimum Description Length and Background Knowledge
The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our Subdue substructure discovery system based on the minimum description length principle. The Subdue system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previo...
متن کاملSubstructure Discovery in the SUBDUE System
Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential component to discovering knowledge in such databases. This paper describes the Subdue system, which uses the minimum description length (MDL) principle to discover sub-structures that compress the database and represent structural co...
متن کاملSubstucture Discovery in the SUBDUE System
Because many databases contain or can be embellished with structural information, a method for identifying interesting and repetitive substructures is an essential component to discovering knowledge in such databases. This paper describes the SUBDUE system, which uses the minimum description length (MDL) principle to discover substructures that compress the database and represent structural con...
متن کاملStructural Knowledge Discovery in Chemical and Spatio-Temporal Databases
Most current knowledge discovery systems use only attribute-value information. But relational information between objects is also important to the knowledge hidden in today’s databases. Two such domains are chemical structures and domains where objects are related in space and time. Inductive Logic Programming (ILP) discovery systems handle relational data, but require data to be expressed as a...
متن کاملGraph Based Concept Learning
Concept Learning is a Machine Learning technique in which the learning process is driven by providing positive and negative examples to the learner. From those examples, the learner builds a hypothesis (concept) that describes the positive examples and excludes the negative examples. Inductive Logic Programming (ILP) systems have successfully been used as concept learners. Examples of those are...
متن کامل